Introduction

This report describes a within-axonal differential expression (DE) analysis of Axonseq data provided by Jimmy Beckers using the Bioconductor R package DESeq2. DESeq2 estimates the negative binomial distribution for each gene in the dataset to model RNA-seq count data, then conducts DE analysis to identify genes that are significantly differentially expressed between experimental conditions.

This report performs the analysis of the following conditions:

  • Comparison 1: C9-1 vs C9-1iso within axon For this comparison you select “Condition: C9-1”, “Treatment: None”, “Compartment: axon” versus “Condition: C9-1iso”, “Treatment: None”, “Compartment: axon”

  • Comparison 2: C9-2 vs C9-2iso within axon For this comparison you select “Condition: C9-2”, “Treatment: None”, “Compartment: axon” versus “Condition: C9-2iso”, “Treatment: None”, “Compartment: axon”

  • Comparison 3: C9 vs C9iso within axon For this comparison you select “Group: C9”, “ALS mutation: Yes”, “Treatment: None”, “Compartment: axon” versus “Group: C9”, “ALS mutation: No”, “Treatment: None”, “Compartment: axon”

Note: instead of “Group: C9”, “ALS mutation: Yes”, you could also select “Condition: C9-1 + C9-2”

Log fold change is calculated with reference to “iso” groups in DE comparisons

## load libraries
library("dplyr")
library("tidyr")
library("tibble")
library("ggplot2")
library("ggrepel")
library("viridis")
library("DESeq2")
library("RColorBrewer")
library("pheatmap")
library("readr")
library("readxl")
library("writexl")
library("janitor")
library("gridExtra")
library("biomaRt")
library("DT")

## set project variables
proj <- "Within_Axon"
sel_comp <- "axon"
ann_version <- 2
c1 <- "C9_1iso vs C9_1"
c2 <- "C9_2iso vs C9_2"
c3 <- "C9_iso vs C9"

Experimental design

We load the complete Axonseq count matrix of all samples and metadata describing the conditions of these samples. The metadata and matrix counts are subset to contain only Within Axon samples. The metadata (or “colData”) of the samples being compared is displayed in an HTML table below:

## load metadata
readRDS(file = paste0("./data/Jimmy_axonseq_project_metadata_20231129.rds")) %>%
  dplyr::filter(!sample %in% c("P19509_2001", "P19509_2039", "P19509_2051"),
                ## no outliers
                compartment == sel_comp) %>% ## only axon samples
  dplyr::rename(genotype = group) %>%
  dplyr::mutate(group = dplyr::case_when(
    grepl("iso", condition)   ~ "C9_iso",
    grepl("_CTRL", condition) ~ "C9_ASO_CTRL",
    grepl("_MO", condition)   ~ "C9_ASO",
    grepl("C9", condition)    ~ "C9",
    grepl("TDP43", condition) ~ "TDP43",
    grepl("FUS", condition)   ~ "FUS")) %>%
  dplyr::filter(group %in% c("C9", "C9_iso")) %>% ## only non-ASO, non-TDP, non-FUS samples
  dplyr::mutate_all(as.factor) %>%
  as.data.frame() -> meta_data
rownames(meta_data) <- meta_data$sample

## load count data
read.table(file = "./data/counts_JB_AXONseq.csv", sep = ";", header = TRUE) %>%
  dplyr::rename(genes = names(.)[1]) %>%
  dplyr::select(genes, meta_data$sample) %>% ## contains only samples in metadata
  tibble::column_to_rownames(var = "genes") %>%
  as.matrix() -> cts

## display metadata table
meta_data %>% DT::datatable()

What was not selected: * somal samples * samples P19509_2001, P19509_2039 and P19509_2051 were removed because they appear to be outliers * ASO, TDP and FUS samples

## by condition
dds <- DESeqDataSetFromMatrix(countData = cts, ## the raw gene counts
                              colData = meta_data, ## our sample metadata
                              design = ~ condition) ## our design

dds2 <- DESeqDataSetFromMatrix(countData = cts, ## the raw gene counts
                              colData = meta_data, ## our sample metadata
                              design = ~ group) ## our design

cat("Number of genes in dds object *before* filtering out rows with zero counts \n")
#> Number of genes in dds object *before* filtering out rows with zero counts
dim(assay(dds))[1]
#> [1] 57131

For QC reasons, genes with counts of all 0 for all samples (i.e. rows in the count matrix with no count measurements) are removed before running DESeq. This is because for rows with all zero counts no variance can be modeled.

keep <- rowSums(counts(dds)) > 0
dds <- DESeq(dds[keep, ])

keep2 <- rowSums(counts(dds2)) > 0
dds2 <- DESeq(dds2[keep2, ])
cat("Number of genes in dds object *after* filtering out rows with zero counts \n")
#> Number of genes in dds object *after* filtering out rows with zero counts
dim(assay(dds))[1]
#> [1] 32040

Results tables are generated using the function results, which extracts a results table with log2 fold changes, p-values and adjusted p-values. With no additional arguments to results, the log2 fold change and Wald test p-value will be for the first variable in the design formula, the experiment group will be the last variable.

res1 <- results(dds, contrast = c("condition", "C9_1iso", "C9_1"), alpha = 0.05)
res2 <- results(dds, contrast = c("condition", "C9_2iso", "C9_2"), alpha = 0.05)
res3 <- results(dds2, contrast = c("group", "C9_iso", "C9"), alpha = 0.05)

Data exploration and quality assessment (QC)

Principal components analysis (PCA)

PCA is a method of visually identifying the similarity or difference between samples. PCA rotates the data cloud onto an orthogonal basis determined by the dimensions of maximal variance. The first two Principal Components (PCs) usually hold the majority of the variance of the data. The following plots show the variance stabilized transformed count matrix samples projected onto the two largest Principal Components (i.e. PC1 and PC2). DESeq2 recommends two types of PCA stabilizations applied to the PCA: * vst “variance stabilizing transformation” * rld “regularized log transformation”

## perform variance stabilizing transformation and PCA and plot with ggplot2
rld <- DESeq2::vst(dds)
pcaData <- plotPCA(rld, intgroup = c("condition", "compartment"),
                   returnData = TRUE)
percentVar <- round(100 * attr(pcaData, "percentVar"))
pcaData %>% 
  ggplot(aes(PC1, PC2, color = condition, shape = compartment)) +
  geom_point(size = 3) +
  geom_text_repel(aes(label = colnames(rld)), force = 5,
                  arrow = arrow(length = unit(0.03, "npc"),
                                type = "closed", ends = "first")) +
  xlab(paste0("PC1: ",percentVar[1],"% variance")) +
  ylab(paste0("PC2: ",percentVar[2],"% variance")) + 
  theme(axis.ticks.x = element_blank(), axis.text.x = element_blank(),
        axis.ticks.y = element_blank(), axis.text.y = element_blank()) +
  coord_fixed() +
  ggtitle(paste0("RLD PCA ", gsub("_", " ", sel_comp)))

Note: we observe that the rld and vst PCA stabilization transformations yield very different projections.

## perform regularized log transformation and PCA and plot with ggplot2
rld <- rlogTransformation(dds)
pcaData <- plotPCA(rld, intgroup = c("condition", "compartment"),
                   returnData = TRUE)
percentVar <- round(100 * attr(pcaData, "percentVar"))
pcaData %>%
  ggplot(aes(PC1, PC2, color = condition, shape = compartment)) +
  geom_point(size = 3) +
  geom_text_repel(aes(label = colnames(rld)), force = 5,
                  arrow = arrow(length = unit(0.03, "npc"),
                                type = "closed", ends = "first")) +
  xlab(paste0("PC1: ",percentVar[1],"% variance")) +
  ylab(paste0("PC2: ",percentVar[2],"% variance")) + 
  theme(axis.ticks.x = element_blank(), axis.text.x = element_blank(),
        axis.ticks.y = element_blank(), axis.text.y = element_blank()) +
  coord_fixed() +
  ggtitle(paste0("VST PCA ", gsub("_", " ", sel_comp)))

Size factors

Size factors are a method of normalizing used by the DESeq function to normalize the data in terms of sequencing depth. Size factor is the median ratio of the sample over a pseudosample: for each gene, the geometric mean of all samples. Size factors account for differences in sequencing depth are typically centered around 1 (indicating comparable sequencing depth).

dds$sizeFactor %>%
  as.data.frame() %>% 
  tibble::rownames_to_column(var = "sample") %>% 
  dplyr::rename(size_factors = names(.)[2]) %>% 
  dplyr::left_join(meta_data, by = "sample") %>% 
  ggplot(aes(x = sample, y = size_factors, fill = condition)) +
  geom_bar(stat = "identity") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1, size = 4)) +
  geom_hline(yintercept = mean(dds$sizeFactor), color = "red", linetype = "dashed")

Dispersion plot

DESeq2 estimates gene dispersion using an algorithm that first generates gene-wise maximum likelihood estimates (MLEs) that are obtained using only the respective gene’s data (black dots). Then, a curve (red) is fit to the MLEs to capture the overall trend of dispersion-mean dependence. This fit is used as a prior mean for a second estimation round, which results in the final maximum a priori (MAP) estimates of dispersion. This results in a “shrinkage” of the noisy gene-wise estimates toward the consensus represented by the red line. The black points circled in blue are detected as dispersion outliers and not shrunk toward the prior (shrinkage would follow the dotted line). A more in-depth theoretical explanation of the DESeq2 algorithm can be found here

plotDispEsts(dds, main = "DESeq2 Dispersion plot")

Quality assessment (QC)

Independent Filtering

DESeq2 performs independent filtering by default using the mean of normalized counts as a filter statistic. A threshold on the filter statistic (first value) is found which optimizes the number of adjusted p-values lower than significance level alpha. The adjusted p-values for the genes which do not pass the filter threshold are set to NA. The results also returns the mean of normalized counts (second value).

cat(paste0("Filter threshold value and mean of normalized counts ", c1, "\n"))
#> Filter threshold value and mean of normalized counts C9_1iso vs C9_1
metadata(res1)$filterThreshold
#> 60.10204% 
#>  1.959663
cat(paste0("Filter threshold value and mean of normalized counts ", c2, "\n"))
#> Filter threshold value and mean of normalized counts C9_2iso vs C9_2
metadata(res2)$filterThreshold
#>        0% 
#> 0.0204094
cat(paste0("Filter threshold value and mean of normalized counts ", c3, "\n"))
#> Filter threshold value and mean of normalized counts C9_iso vs C9
metadata(res3)$filterThreshold
#> 68.3806% 
#> 2.250812

Plot of sample rejections vs filter quantiles

The filterThreshold returns the threshold chosen (vertical line in the plots below) by the DESeq2 analysis of the lowest quantile of the filter for which the number of sample rejections is within 1 residual standard deviation to the peak of a curve fit to the number of rejections over the filter quantiles. The following diagnostic plot shows the number of rejected samples (y-axis) plotted against quantiles of filter (x-axis).

par(mfrow = c(1, 1))
plot(metadata(res1)$filterNumRej, type = "b", main = c1,
     xlab = "Quantiles of filter", ylab = "Number of rejections")
lines(metadata(res1)$lo.fit, col = "red")
abline(v = metadata(res1)$filterTheta)

par(mfrow = c(1, 1))
plot(metadata(res2)$filterNumRej, type = "b", main = c2,
     xlab = "Quantiles of filter", ylab = "Number of rejections")
lines(metadata(res2)$lo.fit, col = "red")
abline(v = metadata(res2)$filterTheta)

par(mfrow = c(1, 1))
plot(metadata(res3)$filterNumRej, type = "b", main = c3,
     xlab = "Quantiles of filter", ylab = "Number of rejections")
lines(metadata(res3)$lo.fit, col = "red")
abline(v = metadata(res3)$filterTheta)

Histogram of frequency of p-values of results

The following plot shows the number of frequency of counts (y-axis) against p-values between 0 and 1 (x-axis).

par(mfrow = c(1, 1))
hist(res1$pvalue, col = "lavender", xlab = "p-values", main = c1)

par(mfrow = c(1, 1))
hist(res2$pvalue, col = "lavender", xlab = "p-values", main = c2)

par(mfrow = c(1, 1))
hist(res3$pvalue, col = "lavender", xlab = "p-values", main = c3)

Results

Summary of results

print(x = paste0(c1, " ",  gsub("_", " ", sel_comp)))
#> [1] "C9_1iso vs C9_1 axon"
print(summary(res1))
#> 
#> out of 32040 with nonzero total read count
#> adjusted p-value < 0.05
#> LFC > 0 (up)       : 30, 0.094%
#> LFC < 0 (down)     : 0, 0%
#> outliers [1]       : 2130, 6.6%
#> low counts [2]     : 18315, 57%
#> (mean count < 2)
#> [1] see 'cooksCutoff' argument of ?results
#> [2] see 'independentFiltering' argument of ?results
#> 
#> NULL
print(x = paste0(c2, " ",  gsub("_", " ", sel_comp)))
#> [1] "C9_2iso vs C9_2 axon"
print(summary(res2))
#> 
#> out of 32040 with nonzero total read count
#> adjusted p-value < 0.05
#> LFC > 0 (up)       : 5, 0.016%
#> LFC < 0 (down)     : 7, 0.022%
#> outliers [1]       : 2130, 6.6%
#> low counts [2]     : 0, 0%
#> (mean count < 0)
#> [1] see 'cooksCutoff' argument of ?results
#> [2] see 'independentFiltering' argument of ?results
#> 
#> NULL
print(x = paste0(c3, " ",  gsub("_", " ", sel_comp)))
#> [1] "C9_iso vs C9 axon"
print(summary(res3))
#> 
#> out of 31453 with nonzero total read count
#> adjusted p-value < 0.05
#> LFC > 0 (up)       : 121, 0.38%
#> LFC < 0 (down)     : 0, 0%
#> outliers [1]       : 0, 0%
#> low counts [2]     : 21909, 70%
#> (mean count < 2)
#> [1] see 'cooksCutoff' argument of ?results
#> [2] see 'independentFiltering' argument of ?results
#> 
#> NULL

Results Annotation

We annotate the DE results with human Ensembl and Entrez IDs by accessing the BioMart database. The saved .RDS object of the biomaRt query results are loaded.

## annotate DE results with Biomart database
# listAttributes(mart = useDataset("mmusculus_gene_ensembl", useMart("ensembl")))
# getBM(attributes = c("ensembl_gene_id", "entrezgene_id", "external_gene_name"),
#       mart = useDataset("hsapiens_gene_ensembl", useMart("ensembl"))) %>%
#   dplyr::rename(ensembl_id = ensembl_gene_id,
#                 entrez_id = entrezgene_id,
#                 genes = external_gene_name) %>%
#   dplyr::filter(stringr::str_length(genes) > 1,
#                 !duplicated(ensembl_id)) -> human_biomart
# saveRDS(object = human_biomart, file = "./data/human_biomart.rds")
human_biomart <- readRDS(file = "./data/human_biomart.rds")

Marker genes

The lists of up and downregulated genes in C9orf72 MNs (bulk-RNA-seq) that from [Selvaraj, B.T, Livesey, M. R et al., 2018][https://pubmed.ncbi.nlm.nih.gov/29367641/] will be used to compare enriched DE genes in our data.

  • List of genes that are significantly upregulated in C9ORF72 mutant motor neurons
# Selvaraj, B.T, Livesey, M. R et al.,                      
# Supplementary data -1                     
readxl::read_excel("~/Documents/tmp/dacruz/202404_jimmy/data/41467_2017_2729_MOESM3_ESM.xlsx",
                   skip = 3) %>%
  dplyr::rename(genes = gene_name) %>%
  dplyr::mutate(C9_upreg_marker = "up") %>%
  as.data.frame() -> df_up
names(df_up) <- gsub("-", "_", names(df_up))

df_up %>% DT::datatable()
  • List of genes that are significantly down regulated in C9ORF72 mutant motor neurons
# Selvaraj, B.T, Livesey, M. R et al., 
# Supplementary data: 2
readxl::read_excel("~/Documents/tmp/dacruz/202404_jimmy/data/41467_2017_2729_MOESM4_ESM.xlsx",
                   skip = 3) %>%
    dplyr::rename(genes = gene_name) %>%
    dplyr::mutate(C9_downreg_marker = "down") %>%
  as.data.frame() -> df_down
names(df_down) <- gsub("-", "_", names(df_down))

df_down %>% DT::datatable()

The annotated DE results are arranged by p-adjusted value. Normalized counts from the are added to the right of the DE statistics.

## RES1
merge(as.data.frame(res1), as.data.frame(counts(dds, normalized = TRUE)),
      by = "row.names", sort = FALSE) %>%
  dplyr::rename(genes = names(.)[1]) %>%
  dplyr::left_join(human_biomart, by = "genes") %>%
  dplyr::mutate(
    non_zero_de = rowSums(dplyr::select(., contains("P19")) != 0),
    perc_non_zero_de = non_zero_de / nrow(meta_data)) %>%
  dplyr::filter(!duplicated(genes)) %>%
  dplyr::left_join(df_up %>% dplyr::select(c(genes, contains("marker"))), by = "genes") %>%
  dplyr::left_join(df_down %>% dplyr::select(c(genes, contains("marker"))), by = "genes") %>%
  dplyr::select(genes, contains("_id"), contains("_marker"), contains("non_zero"),
                everything()) %>%
  dplyr::arrange(padj) %>%
  as.data.frame() -> res_df1

## readjust sig values
# res_df1$pvalue[res_df1$pvalue == 0] <- 1e-300
# res_df1$padj[res_df1$padj == 0] <- 1e-300

## RES2
merge(as.data.frame(res2), as.data.frame(counts(dds, normalized = TRUE)),
      by = "row.names", sort = FALSE) %>%
  dplyr::rename(genes = names(.)[1]) %>%
  dplyr::left_join(human_biomart, by = "genes") %>%
  dplyr::mutate(
    non_zero_de = rowSums(dplyr::select(., contains("P19")) != 0),
    perc_non_zero_de = non_zero_de / nrow(meta_data)) %>%
  dplyr::filter(!duplicated(genes)) %>%
  dplyr::left_join(df_up %>% dplyr::select(c(genes, contains("marker"))), by = "genes") %>%
  dplyr::left_join(df_down %>% dplyr::select(c(genes, contains("marker"))), by = "genes") %>%
  dplyr::select(genes, contains("_id"), contains("_marker"), contains("non_zero"),
                everything()) %>%
  dplyr::arrange(padj) %>%
  as.data.frame() -> res_df2

## readjust sig values
# res_df2$pvalue[res_df2$pvalue == 0] <- 1e-300
# res_df2$padj[res_df2$padj == 0] <- 1e-300

## RES3
merge(as.data.frame(res3), as.data.frame(counts(dds, normalized = TRUE)),
      by = "row.names", sort = FALSE) %>%
  dplyr::rename(genes = names(.)[1]) %>%
  dplyr::left_join(human_biomart, by = "genes") %>%
  dplyr::mutate(
    non_zero_de = rowSums(dplyr::select(., contains("P19")) != 0),
    perc_non_zero_de = non_zero_de / nrow(meta_data)) %>%
  dplyr::filter(!duplicated(genes)) %>%
  dplyr::left_join(df_up %>% dplyr::select(c(genes, contains("marker"))), by = "genes") %>%
  dplyr::left_join(df_down %>% dplyr::select(c(genes, contains("marker"))), by = "genes") %>%
  dplyr::select(genes, contains("_id"), contains("_marker"), contains("non_zero"),
                everything()) %>%
  dplyr::arrange(padj) %>%
  as.data.frame() -> res_df3

## readjust sig values
# res_df2$pvalue[res_df2$pvalue == 0] <- 1e-300
# res_df2$padj[res_df2$padj == 0] <- 1e-300

DE visualizations

MA plots

A MA plot illustrates log-fold expression change between two groups of samples, created by transforming and the data onto two scales: M (the log of the ratio of level counts for each gene between two samples) and A (the average level counts for each gene across the two samples) scales. MA plots demonstrates the difference between samples in terms of signal intensities of read counts. In this type of plot, genes with similar expression levels in two samples will appear around the horizontal line y = 0 (red line). The following MA plot illustrates log-fold expression change for each comparison after the DESeq2 analysis. Significant genes (P < 0.05) are highlighted in red.

results1 <- as.data.frame(res_df1)
results1[is.na(results1)] <- 0.99 # change NA results to 0.99 for correct MAplot
results1 %>%
  ggplot(aes(x = baseMean, y = log2FoldChange)) +
  geom_point(aes(colour = padj < 0.05), size = 0.5) +
  scale_colour_manual(name = 'padj < 0.05',
                      values = setNames(c('red','black'), c(TRUE, FALSE))) +
  scale_x_continuous(trans = "log10", limits = c(0.1, 300000)) +
  geom_smooth(colour = "red") +
  geom_abline(slope = 0, intercept = 0, colour = "blue") +
  theme(plot.title = element_text(hjust = 0.5)) +
  xlab("baseMean (A)") + ylab("log2FoldChange (M)") +
  ggtitle(paste0(c1, " in ",  gsub("_", " ", sel_comp)))

results2 <- as.data.frame(res_df2)
results2[is.na(results2)] <- 0.99 # change NA results to 0.99 for correct MAplot
results2 %>%
  ggplot(aes(x = baseMean, y = log2FoldChange)) +
  geom_point(aes(colour = padj < 0.05), size = 0.5) +
  scale_colour_manual(name = 'padj < 0.05',
                      values = setNames(c('red','black'), c(TRUE, FALSE))) +
  scale_x_continuous(trans = "log10", limits = c(0.1, 300000)) +
  geom_smooth(colour = "red") +
  geom_abline(slope = 0, intercept = 0, colour = "blue") +
  theme(plot.title = element_text(hjust = 0.5)) +
  xlab("baseMean (A)") + ylab("log2FoldChange (M)") +
  ggtitle(paste0(c2, " in ",  gsub("_", " ", sel_comp)))

results3 <- as.data.frame(res_df3)
results3[is.na(results3)] <- 0.99 # change NA results to 0.99 for correct MAplot
results3 %>%
  ggplot(aes(x = baseMean, y = log2FoldChange)) +
  geom_point(aes(colour = padj < 0.05), size = 0.5) +
  scale_colour_manual(name = 'padj < 0.05',
                      values = setNames(c('red','black'), c(TRUE, FALSE))) +
  scale_x_continuous(trans = "log10", limits = c(0.1, 300000)) +
  geom_smooth(colour = "red") +
  geom_abline(slope = 0, intercept = 0, colour = "blue") +
  theme(plot.title = element_text(hjust = 0.5)) +
  xlab("baseMean (A)") + ylab("log2FoldChange (M)") +
  ggtitle(paste0(c3, " in ",  gsub("_", " ", sel_comp)))

Volcano plots

A volcano plot is a type of scatter plot that is used to quickly identify genes that display large magnitude changes that are also statistically significant. A volcano plot is constructed by plotting the negative log of the p-value on the y-axis. This results in data points with low p-values appearing toward the top of the plot. The x-axis is the log of the fold change between the two experimental conditions. The log of the fold change is used so that changes in both directions appear equidistant from the center. Plotting points in this way results in two regions of interest in the plot: those points that are found toward the top of the plot that are far to either the left- or right-hand sides. These represent values that display large magnitude fold changes (on the left or right of center) as well as high statistical significance (toward the top). The following Volcano plot shows log of the fold change and negative log of the p-values for each comparison. Significant genes (P < 0.05) with log2 fold change (> 1) are highlighted in red.

res_df1 %>%
  # dplyr::filter(!is.na(pvalue),
  #               perc_non_zero_de > 0.66) %>%
  dplyr::mutate(threshold = as.factor(abs(log2FoldChange) > 1 & pvalue < 0.05),
                sig_group = as.factor(dplyr::case_when(
                  log2FoldChange > 1 & pvalue < 0.05 ~ !!gsub("vs.*", "", c1),
                  log2FoldChange < -1 & pvalue < 0.05 ~ !!gsub(".*vs", "", c1),
                  TRUE ~ "not significant"))) %>%
  ggplot(aes(x = log2FoldChange, y = -log10(pvalue), color = sig_group)) +
  geom_point(alpha = 0.75, size = 0.75) +
  geom_hline(yintercept = -log10(0.05), color = "red", linetype = "dashed") +
  geom_vline(xintercept = -1, color = "red", linetype = "dashed") +
  geom_vline(xintercept = 1, color = "red", linetype = "dashed") +
  xlab("log2 fold change") +
  ylab("-log10 p-value") +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_color_manual(values = c("magenta", "limegreen", "gray3")) +
  ggtitle(paste0(c1, " in ",  gsub("_", " ", sel_comp)))

res_df2 %>%
  # dplyr::filter(!is.na(pvalue),
  #               perc_non_zero_de > 0.66) %>%
  dplyr::mutate(threshold = as.factor(abs(log2FoldChange) > 1 & pvalue < 0.05),
                sig_group = as.factor(dplyr::case_when(
                  log2FoldChange > 1 & pvalue < 0.05 ~ !!gsub("vs.*", "", c2),
                  log2FoldChange < -1 & pvalue < 0.05 ~ !!gsub(".*vs", "", c2),
                  TRUE ~ "not significant"))) %>%
  ggplot(aes(x = log2FoldChange, y = -log10(pvalue), color = sig_group)) +
  geom_point(alpha = 0.75, size = 0.75) +
  geom_hline(yintercept = -log10(0.05), color = "red", linetype = "dashed") +
  geom_vline(xintercept = -1, color = "red", linetype = "dashed") +
  geom_vline(xintercept = 1, color = "red", linetype = "dashed") +
  xlab("log2 fold change") +
  ylab("-log10 p-value") +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_color_manual(values = c("magenta", "limegreen", "gray3")) +
  ggtitle(paste0(c2, " in ",  gsub("_", " ", sel_comp)))

res_df3 %>%
  # dplyr::filter(!is.na(pvalue),
  #               perc_non_zero_de > 0.66) %>%
  dplyr::mutate(threshold = as.factor(abs(log2FoldChange) > 1 & pvalue < 0.05),
                sig_group = as.factor(dplyr::case_when(
                  log2FoldChange > 1 & pvalue < 0.05 ~ !!gsub("vs.*", "", c3),
                  log2FoldChange < -1 & pvalue < 0.05 ~ !!gsub(".*vs", "", c3),
                  TRUE ~ "not significant"))) %>%
  ggplot(aes(x = log2FoldChange, y = -log10(pvalue), color = sig_group)) +
  geom_point(alpha = 0.75, size = 0.75) +
  geom_hline(yintercept = -log10(0.05), color = "red", linetype = "dashed") +
  geom_vline(xintercept = -1, color = "red", linetype = "dashed") +
  geom_vline(xintercept = 1, color = "red", linetype = "dashed") +
  xlab("log2 fold change") +
  ylab("-log10 p-value") +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_color_manual(values = c("magenta", "limegreen", "gray3")) +
  ggtitle(paste0(c3, " in ",  gsub("_", " ", sel_comp)))

Unfiltered top DEG count plots

Count plots are created for the top 5 unfiltered DE genes for each comparison.

res_df1 %>%
  dplyr::filter(!duplicated(genes)) %>%
  dplyr::slice(1:5) %>%
  dplyr::select(genes, tidyr::contains("P19")) %>%
  tidyr::gather(key = "sample", value = "count", -genes) %>%
  dplyr::left_join(meta_data, by = "sample") %>%
  dplyr::mutate(genes = factor(genes, levels = unique(genes))) %>%
  ggplot(aes(x = sample, y = log10(count), color = condition)) +
  geom_point() +
  theme(axis.text.x = element_blank()) +
  ggtitle(paste0("Gene counts of top 5 genes for ",
                 c1, " in ", gsub("_", " ", sel_comp))) +
  facet_grid(~genes)

res_df2 %>%
  dplyr::filter(!duplicated(genes)) %>%
  dplyr::slice(1:5) %>%
  dplyr::select(genes, tidyr::contains("P19")) %>%
  tidyr::gather(key = "sample", value = "count", -genes) %>%
  dplyr::left_join(meta_data, by = "sample") %>%
  dplyr::mutate(genes = factor(genes, levels = unique(genes))) %>%
  ggplot(aes(x = sample, y = log10(count), color = condition)) +
  geom_point() +
  theme(axis.text.x = element_blank()) +
  ggtitle(paste0("Gene counts of top 5 genes for ",
                 c2, " in ", gsub("_", " ", sel_comp))) +
  facet_grid(~genes)

res_df3 %>%
  dplyr::filter(!duplicated(genes)) %>%
  dplyr::slice(1:5) %>%
  dplyr::select(genes, tidyr::contains("P19")) %>%
  tidyr::gather(key = "sample", value = "count", -genes) %>%
  dplyr::left_join(meta_data, by = "sample") %>%
  dplyr::mutate(genes = factor(genes, levels = unique(genes))) %>%
  ggplot(aes(x = sample, y = log10(count), color = group)) +
  geom_point() +
  theme(axis.text.x = element_blank()) +
  ggtitle(paste0("Gene counts of top 5 genes for ",
                 c3, " in ", gsub("_", " ", sel_comp))) +
  facet_grid(~genes)

Filtered top DEG count plots

Count plots are created for the top 5 DE genes for each comparison, filtered by perc_non_zero_de > 0.66, or in other words, all samples contain 66.6% “complete cases” in terms of expression for each gene.

res_df1 %>%
  dplyr::filter(!duplicated(genes),
                perc_non_zero_de > 0.66) %>%
  dplyr::slice(1:5) %>%
  dplyr::select(genes, tidyr::contains("P19")) %>%
  tidyr::gather(key = "sample", value = "count", -genes) %>%
  dplyr::left_join(meta_data, by = "sample") %>%
  dplyr::mutate(genes = factor(genes, levels = unique(genes))) %>%
  ggplot(aes(x = sample, y = log10(count), color = condition)) +
  geom_point() +
  theme(axis.text.x = element_blank()) +
  ggtitle(paste0("Gene counts of top 5 genes for ",
                 c1, " in ", gsub("_", " ", sel_comp))) +
  facet_grid(~genes)

res_df2 %>%
  dplyr::filter(!duplicated(genes),
                perc_non_zero_de > 0.66) %>%
  dplyr::slice(1:5) %>%
  dplyr::select(genes, tidyr::contains("P19")) %>%
  tidyr::gather(key = "sample", value = "count", -genes) %>%
  dplyr::left_join(meta_data, by = "sample") %>%
  dplyr::mutate(genes = factor(genes, levels = unique(genes))) %>%
  ggplot(aes(x = sample, y = log10(count), color = condition)) +
  geom_point() +
  theme(axis.text.x = element_blank()) +
  ggtitle(paste0("Gene counts of top 5 genes for ",
                 c2, " in ", gsub("_", " ", sel_comp))) +
  facet_grid(~genes)

res_df3 %>%
  dplyr::filter(!duplicated(genes),
                perc_non_zero_de > 0.66) %>%
  dplyr::slice(1:5) %>%
  dplyr::select(genes, tidyr::contains("P19")) %>%
  tidyr::gather(key = "sample", value = "count", -genes) %>%
  dplyr::left_join(meta_data, by = "sample") %>%
  dplyr::mutate(genes = factor(genes, levels = unique(genes))) %>%
  ggplot(aes(x = sample, y = log10(count), color = group)) +
  geom_point() +
  theme(axis.text.x = element_blank()) +
  ggtitle(paste0("Gene counts of top 5 genes for ",
                 c3, " in ", gsub("_", " ", sel_comp))) +
  facet_grid(~genes)

Results tables

Below are three html tables displaying the results the DE comparisons. These tables are interactive and be queried for specific proteins or sorted by column.

Note: these are the results filtered for genes with perc_non_zero_de > 0.66. To view the full unfiltered DE results, please refer to the Excel spreadsheets.

Some other notes on how DESeq2 calculates p-values set to NA. Some values in the results table can be set to NA for one of the following reasons: * If within a row, all samples have zero counts, the baseMean column will be zero, and the log2 fold change estimates, p-value and adjusted p-value will all be set to NA. * If a row contains a sample with an extreme count outlier then the p-value and adjusted p-value will be set to NA. These outlier counts are detected by Cook’s distance. * If a row is filtered by automatic independent filtering, for having a low mean normalized count, then only the adjusted p-value will be set to NA.

Results tables of filtered DEGs for each DE comparison

  • C9_1iso vs C9_1 in axon
res_df1 %>%
  dplyr::filter(perc_non_zero_de > 0.66) %>%
  dplyr::select(-c(contains("_id"))) %>%
  dplyr::mutate_at(vars(non_zero_de:stat), round, 3) %>%
  dplyr::mutate_at(vars(matches("rep")), round, 3) %>%
  DT::datatable()
  • C9_2iso vs C9_2 in axon
res_df2 %>%
  dplyr::filter(perc_non_zero_de > 0.66) %>%
  dplyr::select(-c(contains("_id"))) %>%
  dplyr::mutate_at(vars(non_zero_de:stat), round, 3) %>%
  dplyr::mutate_at(vars(matches("rep")), round, 3) %>%
  DT::datatable()
  • C9_iso vs C9 in axon
res_df3 %>%
  dplyr::filter(perc_non_zero_de > 0.66) %>%
  dplyr::select(-c(contains("_id"))) %>%
  dplyr::mutate_at(vars(non_zero_de:stat), round, 3) %>%
  dplyr::mutate_at(vars(matches("rep")), round, 3) %>%
  DT::datatable()

Save Results

Results are saved as Excel .xslx and .rds files in two versions of each DEG list:

  1. Unfiltered DEG lists (with dropouts)
  2. DEG lists filtered with 66.6% non-zero gene expression values

These Excel files have the following columns:

  • genes = Hugo gene symbol (official gene name)
  • entrez_id = Entrez gene ID
  • ensembl_id = Ensembl gene ID
  • non_zero_de = number of non-zero values across gene rows
  • perc_non_zero_de = percent non-zero values across gene rows out of all samples
  • baseMean = DESeq2 average mean count per gene
  • log2FoldChange = log2 normalized fold change between two DE conditions
  • lfcSE = standard Normal distribution to generate a two-tailed p-value
  • stat = the difference in deviance between the reduced model and the full model, which is compared to a chi-squared distribution to generate a pvalue
  • pvalue = p-value
  • padj = Benjamini-Hochberg FDR p-adjusted value
  • C9_upreg_marker = upregulated in Selvaraj, B.T, Livesey, M. R et al., 2018 mutant C9 motor neurons
  • C9_downreg_marker = downregulated in Selvaraj, B.T, Livesey, M. R et al., 2018 mutant C9 motor neurons

Note: Cook’s Distance could not be calculated because we have condition groups of less than 4 samples.

## populate list of DE results
de_list <- list(res_df1,
                res_df1 %>% dplyr::filter(perc_non_zero_de > 0.66),
                res_df2,
                res_df2 %>% dplyr::filter(perc_non_zero_de > 0.66),
                res_df3,
                res_df3 %>% dplyr::filter(perc_non_zero_de > 0.66))

## name list headers to become sheet names
names(de_list) <- c(paste0(c1, " in ", sel_comp, " unfiltered DEGs"),
                    paste0(c1, " in ", sel_comp, " filtered DEGs"),
                    paste0(c2, " in ", sel_comp, " unfiltered DEGs"),
                    paste0(c2, " in ", sel_comp, " filtered DEGs"),
                    paste0(c3, " in ", sel_comp, " unfiltered DEGs"),
                    paste0(c3, " in ", sel_comp, " filtered DEGs"))

## save as Excel and RDS of DEG lists
saveRDS(object = de_list, file = paste0(
  "./data/", proj, "_DEG_res_v", ann_version, "_", Sys.Date(), ".rds"))
writexl::write_xlsx(x = de_list, path =  paste0(
  "./data/", proj, "_DEG_res_v", ann_version, "_", Sys.Date(), ".xlsx"))

Session Info

sessionInfo()
#> R version 4.3.3 (2024-02-29)
#> Platform: aarch64-apple-darwin20 (64-bit)
#> Running under: macOS Sonoma 14.4.1
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> time zone: Europe/Brussels
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] DT_0.33                     biomaRt_2.58.2             
#>  [3] gridExtra_2.3               janitor_2.2.0              
#>  [5] writexl_1.5.0               readxl_1.4.3               
#>  [7] readr_2.1.5                 pheatmap_1.0.12            
#>  [9] RColorBrewer_1.1-3          DESeq2_1.42.1              
#> [11] SummarizedExperiment_1.32.0 Biobase_2.62.0             
#> [13] MatrixGenerics_1.14.0       matrixStats_1.2.0          
#> [15] GenomicRanges_1.54.1        GenomeInfoDb_1.38.8        
#> [17] IRanges_2.36.0              S4Vectors_0.40.2           
#> [19] BiocGenerics_0.48.1         viridis_0.6.5              
#> [21] viridisLite_0.4.2           ggrepel_0.9.5              
#> [23] ggplot2_3.5.0               tibble_3.2.1               
#> [25] tidyr_1.3.1                 dplyr_1.1.4                
#> 
#> loaded via a namespace (and not attached):
#>  [1] DBI_1.2.2               bitops_1.0-7            rlang_1.1.3            
#>  [4] magrittr_2.0.3          snakecase_0.11.1        compiler_4.3.3         
#>  [7] RSQLite_2.3.6           mgcv_1.9-1              png_0.1-8              
#> [10] vctrs_0.6.5             stringr_1.5.1           pkgconfig_2.0.3        
#> [13] crayon_1.5.2            fastmap_1.1.1           dbplyr_2.5.0           
#> [16] XVector_0.42.0          labeling_0.4.3          utf8_1.2.4             
#> [19] rmarkdown_2.26          tzdb_0.4.0              purrr_1.0.2            
#> [22] bit_4.0.5               xfun_0.43               zlibbioc_1.48.2        
#> [25] cachem_1.0.8            jsonlite_1.8.8          progress_1.2.3         
#> [28] blob_1.2.4              highr_0.10              DelayedArray_0.28.0    
#> [31] BiocParallel_1.36.0     parallel_4.3.3          prettyunits_1.2.0      
#> [34] R6_2.5.1                bslib_0.7.0             stringi_1.8.3          
#> [37] lubridate_1.9.3         jquerylib_0.1.4         cellranger_1.1.0       
#> [40] Rcpp_1.0.12             knitr_1.45              splines_4.3.3          
#> [43] Matrix_1.6-5            timechange_0.3.0        tidyselect_1.2.1       
#> [46] rstudioapi_0.16.0       abind_1.4-5             yaml_2.3.8             
#> [49] codetools_0.2-20        curl_5.2.1              lattice_0.22-6         
#> [52] withr_3.0.0             KEGGREST_1.42.0         evaluate_0.23          
#> [55] BiocFileCache_2.10.2    xml2_1.3.6              Biostrings_2.70.3      
#> [58] filelock_1.0.3          pillar_1.9.0            generics_0.1.3         
#> [61] RCurl_1.98-1.14         hms_1.1.3               munsell_0.5.1          
#> [64] scales_1.3.0            glue_1.7.0              tools_4.3.3            
#> [67] locfit_1.5-9.9          XML_3.99-0.16.1         grid_4.3.3             
#> [70] crosstalk_1.2.1         AnnotationDbi_1.64.1    colorspace_2.1-0       
#> [73] nlme_3.1-164            GenomeInfoDbData_1.2.11 cli_3.6.2              
#> [76] rappdirs_0.3.3          fansi_1.0.6             S4Arrays_1.2.1         
#> [79] gtable_0.3.4            sass_0.4.9              digest_0.6.35          
#> [82] SparseArray_1.2.4       farver_2.1.1            htmlwidgets_1.6.4      
#> [85] memoise_2.0.1           htmltools_0.5.8.1       lifecycle_1.0.4        
#> [88] httr_1.4.7              bit64_4.0.5

References:

Love, M.I., Huber, W., Anders, S. (2014) “Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.” Genome Biology, 15:550. 10.1186/s13059-014-0550-8

Anders, Simon, and Wolfgang Huber. 2010. “Differential Expression Analysis for Sequence Count Data.” Genome Biology 11:R106. http://genomebiology.com/2010/11/10/R106.

Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, Smyth GK (2015). “limma powers differential expression analyses for RNA-sequencing and microarray studies.” Nucleic Acids Research, 43(7), e47. doi: 10.1093/nar/gkv007.